Biodiversity of Philippine marine fishes: A DNA barcode reference library based on voucher specimens

Accurate identification of fishes is essential for understanding their biology and to ensure food safety for consumers. DNA barcoding is an important tool because it can verify identifications of both whole and processed fishes that have had key morphological characters removed (e.g., filets, fish meal); however, DNA reference libraries are incomplete, and public repositories for sequence data contain incorrectly identified sequences. During a nine-year sampling program in the Philippines, a global biodiversity hotspot for marine fishes, we developed a verified reference library of cytochrome c oxidase subunit I (COI) sequences for 2,525 specimens representing 984 species. Specimens were primarily purchased from markets, with additional diversity collected using rotenone or fishing gear. Species identifications were verified based on taxonomic, phenotypic, and genotypic data, and sequences are associated with voucher specimens, live-color photographs, and genetic samples catalogued at Smithsonian Institution, National Museum of Natural History. The Biodiversity of Philippine Marine Fishes dataset is released herein to increase knowledge of species diversity and distributions and to facilitate accurate identification of market fishes.


Background & Summary
In 2007, the United States Senate requested the Government Accountability Office (GAO 1 ) investigate several federal agencies responsible for seafood and determine actions necessary to detect and prevent seafood fraud. GAO (2009) found that Americans consumed almost 5 billion pounds of seafood annually, more than 80% of which was imported, and determined that current practices were not sufficient to assure consumers that the seafood they purchased was correctly labeled. The Presidential Task Force on Combating Illegal, Unreported, and Unregulated (IUU) Fishing and Seafood Fraud identified products like grouper and snapper as particularly at risk because the same species are marketed under different names in different countries, and extensive processing, common for imported products, removes diagnostic characters required for morphological species identification 2 . As part of its summary report, GAO 1 recommended "a federal agency wide library of seafood species standards" be developed. In response, the United States Food and Drug Administration (FDA), which is responsible for the identification of seafood distributed through interstate commerce, established a collaboration with the Smithsonian Institution's National Museum of Natural History (NMNH), Division of Fishes and Laboratories of Analytical Biology (LAB) to develop a DNA-barcode reference library for commercial seafood as a regulatory tool for species identification [3][4][5] .
During the past 30 years, genetic sequencing has improved the identification of global biodiversity. Commonly known as DNA barcoding, sequencing one or more loci and comparing these data to those in reference libraries is an efficient way to identify or verify species. DNA barcoding has been used to identify species diversity of a region (e.g. [6][7][8][9], describe new species (e.g. 10 ), link different ontogenetic stages (e.g. 11,12 ), detect illegally traded wildlife (e.g. 13 ), and confirm the identity of animals sold in markets (e.g. 14 ). The FDA now uses DNA barcoding as its primary tool for species identification of seafood products and has used this technique successfully both in seafood-related-illness investigations 15,16 and in cases of species substitution for economic gain 17 .
Success of species-level identification using barcodes depends on completeness and accuracy of DNA reference libraries 8,18,19 . Public sequence repositories such as GenBank (https://www.ncbi.nlm.nih.gov) and the Barcode of Life Database (BOLD; https://www.boldsystems.org/) are essential resources for DNA barcoding, yet they contain misidentified sequences (e.g. [20][21][22][23]. Misidentifications can happen because some species are difficult to identify morphologically or because of errors in data management. In other cases, sequence data does not align with currently accepted morphological identifications for legitimate reasons including cryptic diversity, mixed genealogies between sister species, incomplete lineage sorting, and introgressive hybridization 8 . Incongruent sequences that lack associated voucher specimens, have incomplete collection data, or lack protocols for revising identifications can make problematic sequences in public repositories difficult to resolve. Thus, reference libraries greatly increase in scientific and practical value when sequences are linked to cataloged voucher specimens in natural history museums that can be reexamined to verify or revise identifications.
With more than 7,600 islands and 36,000 kilometers of coastline in the Coral Triangle, the Philippines is the epicenter of marine shorefish biodiversity and home to more than 2,600 species of marine fishes 24-26 . Shorefish Fig. 1 Map of sampling localities. For fishes purchased from markets, the market location is shown; vendors stated that fishes were captured from local coastal waters in the vicinity of the markets. For fishes collected in the field using rotenone while snorkeling, SCUBA diving, or using fishing gear, points on the map represent precise collection localities.
biodiversity is a critical resource to Filipinos, with the Philippines ranking eleventh in global marine capture production 27 and Philippine fish markets are some of the most species diverse markets in the world 10 . Roughly 70% of Filipino nutritional protein comes from fishes and more than 1.6 million people depend on the fishing industry for their livelihood 28 . Not all fishes caught are sold within the country; the Philippines exported $46 million in fish filets and other fish meat in 2021 and 21% of that was imported by the United States 29 . Given the remarkable biodiversity of the region and the high species diversity sold in markets, accurate identification of seafood is essential to food safety and management for both the Philippines and the countries that import fish products from the Philippines. To inventory species sold in Philippine fish markets and additional diversity of the region, the Bureau of Fisheries and Aquatic Resources-National Fisheries Research and Development Institute (BFAR-NFRDI), Department of Agriculture, Philippines, and the NMNH developed a collaboration in 2011 to generate a genetic reference library based on voucher specimens to quickly and efficiently identify fishes, regardless of whether they are whole or processed (e.g., filets, fishmeal).
Our dataset, Biodiversity of Philippine Marine Fishes, is the result of nine years (2011-2019) of sampling ( Fig. 1) and includes 2,525 specimens representing 984 species. Seventy-seven percent of specimens in our dataset were purchased from fish markets; the remaining 23% were collected using rotenone or fishing gear to capture additional diversity of the region (Table 1). Specimens were sequenced for a ~655 base pair portion of the mitochondrial cytochrome c oxidase I locus (COI) to develop a barcode reference library linked to voucher specimens (Fig. 2). The dataset represents the most comprehensive DNA reference library of Philippine marine fishes, and includes vouchered museum specimens, collection data, live color photographs (e.g., Figs. 3, 4), and genetic samples for future analyses. Species identifications were verified based on taxonomic, phenotypic, and genotypic data (see Methods and Technical Validation sections; Fig. 2). Among the 135 families of fishes included in the verified dataset, the families Labridae (78 species), Gobiidae (69 species), Epinephelidae (53 species), and Pomacentridae (53 species) have the greatest species diversity (Fig. 5a). The dataset includes sequences for 55 species in 25 families that were not previously publicly available on GenBank or BOLD for any loci, and an additional 29 species in 19 families that represent the first publicly available COI barcode sequence for the taxon (as of September 28, 2022, Table 1, Fig. 5b; see Verified specimen records deposited at FigShare 30 ). The Biodiversity of Philippine Marine Fishes dataset represents ~50% of the estimated Philippine market fish diversity 10 and will serve as a foundational checklist and reference library to improve knowledge of species diversity and distributions, to identify and describe new species, and to confirm the identity of commercially caught fishes in this global biodiversity hotspot.

Purchased from market
Collected in field by rotenone or fishing gear Combined datasets  Table 1. Overview of 2,525 verified specimens in the dataset highlighting that different collecting methods are important for completing reference libraries. Most, 760 species, were purchased from markets, 340 species were collected in the field, and 116 species were collected both from the market and field. The majority of the newly sequenced (60%) and newly sequenced for COI species (69%) were collected in the field, likely because market collections emphasized commercial species with large maximum sizes, whereas less well-known cryptobenthic fishes were more often collected in the field. www.nature.com/scientificdata www.nature.com/scientificdata/ Fig. 3 Photographs of groupers (Epinephelus), a challenging genus to identify and one that is frequently mislabeled. Live-color photographs, voucher specimens, and molecular sequence data enabled verified identifications of these specimens collected from Philippine fish markets. www.nature.com/scientificdata www.nature.com/scientificdata/ Methods Specimen collection. Between 2011 and 2019, we collected and verified 2,525 specimens, representing 135 families, 445 genera, and 984 species of fishes (see Verified specimen records deposited at FigShare 30 ). Of these, 1,956 (77%) were purchased from fish landings, roadside stalls, and municipal and city markets (155 localities; Fig. 1; see Verified specimen records deposited at FigShare 30 ). The remaining 569 fishes included in the dataset were collected from near-shore habitats using rotenone while snorkeling or SCUBA diving, or using fishing gear (71 localities; Fig. 1 www.nature.com/scientificdata www.nature.com/scientificdata/ field to capture live colors by J. T. Williams using Fujifilm and Nikon camera bodies with 105 mm macro lenses under flash or LED daylight lighting. Tissue samples from each specimen were preserved in ethanol and M2 lysis buffer (AutoGen Inc.), and whole voucher specimens were fixed using 10% formalin. Voucher specimens were transferred to 75% EtOH for long-term storage and cataloged in the NMNH Fish Collection and associated tissues and extractions were archived in the NMNH Biorepository (see Verified specimen records deposited at FigShare 30 . Specimen lengths used herein are reported as standard length (SL).

Specimen identification and validation.
Morphological identifications were made in the field on fresh specimens, and preserved specimens further examined at NMNH by J. T. Williams, K. E. Carpenter, K. E. Bemis, M. G. Girard, and D. E. Pitassy using global, regional, or taxon-specific keys (e.g. 25,31,32 ). Voucher identities were verified using molecular characters by comparing newly generated sequences with those available through public repositories (e.g., GenBank, BOLD) as described in the Technical Validation section below and in Fig. 2. To obtain DNA sequence data, tissues were extracted using the AutoGenPrep 965 (AutoGen, Holliston, Massachusetts, www.nature.com/scientificdata www.nature.com/scientificdata/ USA) following manufacturer protocols. We targeted the 655 base-pair (bp) barcode of COI following Weigt et al. 7 using the primers from Baldwin et al. 33 . Sequencing of PCR products was done on an Automated ABI 3730xl at the NMNH, with sequence trace files trimmed of low-quality ends and forward and reverse reads assembled into contigs using Sequencher 5.4 (Gene Codes). Resulting sequences were 515-686 bps in length (mean: 651.7, median: 655, mode: 655, standard deviation: 12.7). Sequence alignments were performed using MAFFT 7.475 34 within Geneious 11.1.5 35

Technical Validation
In addition to morphological identification, all vouchers were validated based on molecular characters using COI data from BOLD. The BOLD database was used for verification because it has more barcode sequences (when including those that are "private") than available on GenBank and because barcodes published on GenBank are actively extracted and included within the BOLD database. Sequences generated from voucher specimens were imported into Geneious and multiple-sequence alignments were generated using MAFFT for each morphologically identified taxon. All taxon-specific alignments were checked individually to ensure each represented a single taxon and to identify divergent sequences and associated vouchers. Sequences were considered to be the same taxon if sequence identity was ≥97.5%. Sequences with similarity ≤97.4% were aligned with additional sequences and vouchers were examined to determine a revised identification.
Once each multiple-sequence alignment represented a single taxon based on our criteria, a representative sequence was submitted to the BOLD SYSTEMS Identification Engine via the web portal and searched against "All Barcode Records on BOLD. " The results, including the 101-terminal Neighbor-Joining (NJ) phylogeny, were examined to confirm identity of submitted sequences based on monophyletic groups. In total, 2,371 of the 2,525 (93.9%) sequences in our dataset were confirmed using publicly available sequences (see Verified specimen records deposited at FigShare 30 ). For the remaining 154 sequences, representing 84 species, a matching COI barcode was not publicly available on GenBank or BOLD. These 154 sequences represent the first publicly available barcode for their identified taxon. Sequences in this category are either "Newly sequenced" or "Newly sequenced for COI" based on if they have been sequenced for one or more locus previously (Fig. 5b, see Verified specimen records deposited at FigShare 30 ); determinations as of 28 September 2022).

Usage Notes
The Biodiversity of Philippine Marine Fishes dataset is freely available to use for DNA barcoding or metabarcoding surveys, specimen identification, or other purposes (see Data Records). Additional specimens were collected that could not be verified based on currently available taxonomic, phenotypic, and genotypic information. Although not included in the verified sequence dataset released herein, these additional voucher specimens all have associated collection metadata, live-color photographs, and archived genetic samples available for future analyses (see Unverified specimen records deposited at FigShare 30 ). As our understanding of the taxonomy of fishes in the Philippines increases, additional sequences from these collections will be verified and incorporated into the GenBank and BOLD projects.

Code availability
No custom code was used.